Computing the R of the QR factorization of tall 
and skinny matrices using MPI_Reduce 



Julien Langou 

University of Colorado Denver, Denver CO 80202, USA, 
julien. langou@ucdenver . edu, 
WWW home page: http://www.math.ucdenver.edu/~langou/ 
This work was supported by NSF-CCF (grant #881520). 



Abstract. A QR factorization of a tall and skinny matrix with n columns 
can be represented as a reduction. The operation used along the reduc- 
tion tree has in input two n-by-n upper triangular matrices and in output 
an n-by-n upper triangular matrix which is defined as the R factor of 
the two input matrices stacked the one on top of the other. This opera- 
tion is binary, associative, and commutative. We can therefore leverage 
the MPI library capabilities by using user-defined MPI operations and 
MPUReduce to perform this reduction. The resulting code is compact 
and portable. In this context, the user relies on the MPI library to select 
a reduction tree appropriate for the underlying architecture. 

1 Background 

In |5I6| . we introduced the so-called "Communication- Avoiding algorithms 11 for 
performing the QR factorization of a dense matrix. These algorithms minimize 
the number of communications between computing units (parallel case) and/or 
minimize the number of communications between computing unit and main 
memory (sequential case). 

Communication- Avoiding algorithms fall in the class of (panel factoriza- 
tion) /(update) algorithms. For an m-hy-n matrix, the algorithm perform the 
factorization of the first p columns (called the "panel") and then update the 
remaining n — p columns, and so on. We note that, since p is small with respect 
to m, the panel is tall and skinny. This mechanism is standard and is for exam- 
ple widely used in LAPACK's subroutines. The Communication-Avoiding algo- 
rithms differs from LAPACK ones by the way they handle panel factorizations. 
The main idea is to reformulate the Householder QR factorization algorithm of 
a tall and skinny matrix as a reduction operation (see |5l6j ). The mathematics 
behind these algorithms is slightly different from the textbook ones however they 
are stable. Incidentally we note that actually they are slightly more stable than 
the previous algorithm [SJ. 

The Communication- Avoiding algorithms encompass a wide family of algo- 
rithms depending on which reduction tree is chosen. For example, the "Tile 
Algorithms" are Communication- Avoiding algorithms with a flat reduction tree. 
They are polylogarithmically optimal in the sequential case |5|6j . They have 
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been successfully used in the out-of-core context where I/O time between disk 
and processor is predominant [9]. (We note that the work in [5] predates our 
work of three years.) The tile algorithm also provides a large parallelism which 
makes them very suitable in the multicore context [314113] . They have been im- 
plemented on the IBM Cell Broadband Engine [TT] and on parallel distributed 
machines [14] . 

The 11 Tall and Skinny QR" (TSQR) algorithm is a Communication- Avoiding 
algorithm with a binary reduction tree, it is optimal in the parallel case (see |5I6| ). 
it has been used on tall and skinny matrices in parallel distributed in [6] and in 
the multicore context [10] on more square matrices. 

Recent experiments on the grid [1] have used more original trees in order to 
account for the topology of the underlying network. 

The motivations for this paper are two folds. First, we present the reduction 
operation of our communication avoiding algorithms in the context of MPLreduce, 
we believe this is a pedagogical example that illustrates this reduction in its all 
generality. A binary reduction tree is an option, a flat reduction tree is another 
one, but actually any reduction tree is suitable. (This simply stems from the 
associativity of the binary operation we are using.) This demonstrates the gen- 
erality of the approach. The second goal is to present the MPI community with 
a new and useful reduction operation. MPI currently supports our reduction and 
the two implementations we have tested (MPICH and Open MPI) behave cor- 
rectly. Nevertheless we believe this is an interesting performance optimization 
problem. 

2 A reduction operation 

Figure[T]illustrates the algorithm when a binary tree is used. At the start, we need 
to perform the local QR factorization of our matrices. This is the computational 
step 1 (red circles). Then we "reduce" the four R factors to obtain the final R 
factor. We use a binary reduction tree where communications are shown in blue 
and reduction operations are done with computational step 2 (red circles). At 
each level of the tree, we want to reduce two upper triangular matrices with the 
reduction operation: 

R := qr(.R u R 2 ), 

where the qr operator represents the QR factorization of the two input matrices 
stacked the one on top of the other. The C code to perform this operation is 
given in Table [3j 

This operation is binary. (Trivial: take two upper triangular matrices in input, 
get one upper triangular matrix in output.) We can also prove it is associative, 
that is 

qr(i? 1 ,qr(i? 2 ,i? 3 )) = qr (qr {R u i? 2 ) , R 3 ) . 

This is less obvious and we need to warn the reader that the equal sign above 
needs to be understood as 11 essentially equaV . This comes from the fact that the 
R factor of a QR factorization is essentially unique meaning that it is unique 
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up to multiplication of each row by +1 or -1 (for the real arithmetic case). In 
other words, we have an associativity as good as our QR factorization definition. 
This is therefore completely justified. We note that one can actually impose the 
diagonal of the R factor to have positive signs .7 , in which case we have (strict) 
uniqueness of the The associativity allows us to construct various reduction trees. 

This binary operation is also (essentially) commutative. (We can specify this 
additional property to MPI and MPI can take advantage of it.) 

Independently of the reduction tree used, the stability and the number of 
operations performed remains the same. 

We note that, if each triangular matrix is n-by-n, our reduction operation 
performs 0(n 3 ) operations for 0(n 2 ) data communicated. Reduction operations 
in general have the same number of operations as data communicated. 

3 Implementation 

We have implemented the reduction operation using the MPLreduce and MPLAllrcducc 
functions. This enables us to have a two-line long code, see Table [T] 

lapack_dgeqrf( mloc, n, A, Ida, tau, work, lwork, ftinfo ) ; 
MPI_Allreduce( MPI_IN_PLACE, R, 1, mpi.datatype jnatrixR, 
LILA_MPIOP_qR_UPPER, mpi_comm) ; 

Table 1. Code for performing QR factorization when only the R-factor is 
needed. We use MPLAllreduce to perform the reduction. 



The first step is to perform the local QR factorization (the first blue computa- 
tional step in Figure [I]). This is done by calling the LAPACK routine DGEQRF. 
The second step is to perform the reduction with MPLAllreduce. We use a user- 
defined datatype for upper triangular matrix (mpi_datatype_matrixR). The re- 
duction operation is provided to MPI as a user-defined operation (LILA_MPIOP_QR_UPPER) 
that implements the C code provided in Table[3| The difference between MPI_reduce 
and MPLAllreduce is that, at the end of MPIjreduce, the R factor is on the 
root process only whereas in MPLAllreduce the R factor is redistributed all 
the way down to each process participating in the reduction. The end result of 
MPLallreduce is equivalent to the one of MPLreduce followed by MPLBcast. 

The QR factorization of two triangular matrices one on top of the other as 
implemented by LILA_qr_uppers in Table [3] as two outputs. The first output is 
a triangular matrix R which is stored in place of i?i which contains the R factor 
of the QR factorization, the second output is a triangular matrix V which is 
stored in place of i?2 and which contains the Householder vectors used to create 
the unitary transformation that transforms [R\\R2\ in [i?;0]. 

The present discussion considers the binary operation R =:= qr(i?i,i?2). 
We essentially disregard the need for further use of the Householder vectors V. 
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When only the R factor is needed, this is indeed appropriate. V is not needed 
any longer. 

If we want to compute the Q factor of the QR factorization, we apply the 
transpose of the Householder transformation to the identity matrix. If we want 
to update the trailing matrix during a blocked QR factorization, we apply the 
Householder transformation to the trailing matrix. In these two cases, we need 
to be able to recreate the unitary transformation (or its transpose) used during 
the reduction. This means that we need to store the Householder vectors at each 
step of the reduction tree and we also need to store the shape of the reduction 
tree. Currently MPI docs not allow us to do this easily. The MPI community 
might be interested in rendering these operations more straightforward. 

4 Experimental Results 

Experiments are performed on the Blue Gene/L System (frost.ucar.edu). 
Each compute node and I/O node is a dual-core chip, containing two 700MHz 
PowerPC-440 CPUs, 512MB of memory, and two floating-point units (FPUs) 
per core. Each processor has a peak of 2.8 GFlop/sec. 

For the experiments, instead of using LAPACK DGEQRF as shown in Ta- 
ble [l| we have been using a recursive variant named DGEQR3 8J since this 
latter performs better on this infrastructure. 

Experimental results are presented in Table [2] We compare three implemen- 
tations. The first row represents TSQR with hand made reduction tree (we use 
a binary tree), the second row represents TSQR with reduction performed by 
MPLAllreduce (so this is the exact same code as in Table [lj , the third row 
is ScaLAPACK Householder QR factorization (PDEGQRx: best of PDGEQR2 
and PDGEQRF). (All three codes return numerically correct results.) 

We present the performance in MFlop/sec/proc of the operation. (The num- 
ber of floating point operations is taken as 2mn 2 .) This is a strong scalability 
experiment, the matrix size is kept constant (m = 1,000,000 and n — 50) and 
the number of processors is increased from 32 to 256. 

We observe that TSQR with MPLAllreduce behaves nicely. It outperforms 
the ScaLAPACK Householder QR factorization (PDGEQRx) quite significantly. 
This is due to the fact that we are comparing a Communication Avoiding algo- 
rithm (TSQR) with a non communication avoiding one (PDGEQRx). However 
our hand coded binary tree reduction TSQR outperforms the MPLAllreduce im- 
plementation. (In the future we would like to see which tree the MPLAllreduce 
is selecting. ) 

5 Conclusion 

Using high level language is important for the portability of our codes. The two- 
line code presented in this manuscript describes at a high level an algorithm 
to compute the R of the QR factorization of tall and skinny matrices using a 
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# of processors 32 64 128 256 

1. TSQR 690 666 662 610 

2. TSQR with MPLReduce 420 411 414 392 

3. ScaLAPACK Householder QR 193 190 206 184 

Table 2. Performance in MFlop/scc/proc to compute the R factor of the QR 
factorization. The number of floating point operations is taken as 2mn 2 for all 
operations. The matrix has dimension m = 1,000,000 and n = 50. This is a 
strong scalability experiment, the matrix size is kept constant while the number 
of processors is increased. 



reduction tree. This code can be ported on many architectures and still keeps 
it efficiency providing that the middleware layer (MP1) is able to select the 
appropriate reduction tree for the underlying infrastructure (cluster of multicore 
node, grid, etc.) and the reduction operation. 

In the tall and skinny context, there is one reduction involved total. In the 
general case, however, it is important to realize that each reduction is followed 
by an update, itself followed by a reduction, etc. The choice of the reduction tree 
impacts dramatically subsequent operations. A fiat tree for example enables a 
pipeline of tasks and, for that reason, the flat tree is often preferred in the parallel 
square case. 

Communication algorithms encompasses a wide variety of algorithms that 
are now actively studied. It is important to understand that they derive from 
the same formulation (given in |5l6ll2j L- panel factorization with reduction tree 
and update. 

The optimization of the reduction tree for the Householder QR factorization 
for various matrix shapes, various network topologies, various memory hierar- 
chies is currently a research problem. 

Note. The content of this manuscript was initially presented in [12] but never 
published. The introduction and conclusion have been adapted to reflect recent 
research work. 
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Fig. 1. AllRcduce Householder Algorithm for four processes when only the R- 
factor is requested. 
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int LILA qr uppers (int n, double *R1, double *R2, double *tau, 
double *work ){ 

/* 

* The cost of this operation is 2/3 n~3 to compare with 

* 10/3n~3 (=2mn~2-2/3*n~3 , with m=2n) using a standard Householder code 
* 

* We exploit the fact that: 

* - the two matrices Rl and R2 are triangular 

* - the matrix H is lower triangular 

* The cost comes mainly from step (j.2): 2*(n-j)*j and (j.4): 2*(n-j)*j 

* that you integrate from j=l:n. 
* 

* Purpose 



* 

* Consider the (2N)-by-N matrix: 

* W = [ Rl ] 

* [ R2 ] 
* 

* LILA_qr .uppers performs the QR factorization of W. 
* 

* The output are stored in 

* TAU, the scalars to apply the Householder transformation 

* for further use 

* R2, the upper triangular matrix that holds the Householder 

* vectors . They are represented as : 

* [I ] 

* [ R2 ] 

* Rl , the upper triangular matrix that holds the R factor 
* 

* J. Langou, 2007. 
*/ 

int j ; 

for (j=0;j<n; j++){ 

lapack_dlarfg ( j+2, &(R1 [j*n+j] ) , &(R2[j*n]), 1, &(tau[j])); 
if ((j<n-l)&&(tau[j] != 0.0e+00)){ 

/* 

* w := R2(l: j , j+l:n) ' * v(l:j) + Rl(j , j+1 :n) 
*/ 

cblas dgemv( CblasColMajor , CblasTrans, j + n-j-1, 1.0e+00, 

&(R2[(j + l)*n] ) , n, &(R2[j*n]), 1, 0.0e+00, work, 1 ); 
cblas_daxpy( n-j-1, 1.0e+00, &(R1 [(j + 1) *n+j] ) , n, work, 1); 

/* 

* Rl(j,j + l:n) = Rl(j,j+l:n) - tau * w 

* R2(l: j , j+l:n) = R2 (1 : j , j+1 :n) - tau * v(l:j) * w 
*/ 

cblas_daxpy( n-j-1, -tau[j] , work, 1, &(R1 [(j+l)*n+j] ) , n) ; 
cblas_dger( CblasColMajor, j + 1, n-j-1, -tau[j] , &(R2[j*n]), 1, 
work, 1, &(R2[(j + l)*n]) , n ); 

} 

} 

return 0; 

1 

Table 3. Code for Householder QR factorization of two upper triangular ma- 
trices. 



